Ramesh Gopinath's Unsupervised Learning AIML Project

Part A - Automobile

1 Data Understanding and Exploration

...1F.... Insights .... There is multi-collinearity...

Miles per gallon is the Dependent variable and all others are Independent variables

HP is also a numerical column but not shown in describe(), and shows up as an Object, meaning it has some missing values

Displacement & Cylinder seems to be highly positively corelated ... 0.95
Weight and Cylinder seems to be highty positively corelated        0.90
Weight and Displacement seems to be highly positively corelated    0.93

Weight and Miles per gallon are negatively corelated.............. 0.83
Displacement and Miles per gallon are also negatively corelated... 0.80
Cylinder and Miles per gallon are also negatively corelated ....   0.78

2. Data preparation and Analysis

2A... feature wise percentage of missing values

3 Clustering

Part B - Domain Automobile

1. Data Understanding and Cleaning

There are totally 846 rows, but some columns seem to be missing a few values hence reporting lesser count

  1. circularity = 841 - 5 missing values
  2. distance_circularity = 842 - 4 missing values
  3. radius_ratio = 840 - 6 missing values
  4. pr.axis_aspect_ratio = 844 - 2 missing values
  5. scatter_ratio = 845 - 1 missing value
  6. elongatedness = 845 - 1 missing value
  7. pr.axis_rectangular = 843 - 3 missing values
  8. scaled_variance = 843 - 3 missing values
  9. scaled_variance1 = 844 - 2 missing values
  10. scaled_radius_of_gyration = 844 - 2 missing values
  11. scaled_radius_of_gyration.1 = 842 - 4 misssing values
  12. skewness_about = 840 - 6 missing values
  13. skewness_about.1 = 845 - 1 missing value
  14. skewness_about.2 = 845 - 1 missing value

while the following non-object columns seems to be having no issues

  1. compactness
  2. max.length_aspect_ratio
  3. max.length_regularity
  4. hollows_ratio

This makes it 18 columns out of the 19

Class column does'nt seem to have any missing values.

There are 14 columns listed with "True" which matches with our missing values 14 columns listed earlier.

These numbers match my earlier numbers

There are no duplicate rows in the dataset.

2.... Data Preparation

3. Model Building

The following features have outliers

  1. radius_ratio - small outliers
  2. pr.axis_aspect_ratio - lot of outliers
  3. max.length_aspect_ratio - too much of outliers
  4. scaled_variance - small outliers
  5. scaled_variance.1 - small outliers
  6. scaled_radius_of_gyration.1 - lots of outliers
  7. skewness_about - few outlier
  8. skewness_about.1 - 1 outlier

....... 3H..... Insights on the SVC with PCA 5 components

It is observed that after applying the Dimensionality reduction by PCA 5 components, the train score and test score has reduced

Train score with PCA 5 features 0.6878698224852071 Test score with PCA 5 features 0.6705882352941176

the first SVC had given the following scores

Training set without PCA with 18 original features - 0.9585798816568047 Testing set without PCA with 18 original features - 0.9529411764705882

So even though the accurancy scores have reduced after applying PCA 5 components, the variance is still good

Training - 0.6878 & Testing - 0.6705

4. Performance improvement

5. Data Understanding and Cleaning

.... 5A....... pre-requisites and assumptions of PCA

Pre-requisites/Assumptions of Principal component Analysis

  1. PCA assumes that the principal component with high variance must be paid attention and the PCs with lower variance are disregarded as noise. Pearson correlation coefficient framework led to the origin of PCA, and there it was assumed first that the axes with high variance would only be turned into principal components.

  2. All variables should be accessed on the same ratio level of measurement. The most preferred norm is at least 150 observations of the sample set with a ratio measurement of 5:1.

  3. Extreme values that deviate from other data points in any dataset, which are also called outliers, should be less. More number of outliers will represent experimental errors and will degrade your ML model/algorithm

  4. You should have sampling adequacy, which simply means that for PCA to produce a reliable result, large enough sample sizes are required.

  5. The data should be suitable for data reduction. Effectively, one needs to have adequate correlations between the variables in order for variables to be reduced to a smaller number of components.

.... 5B... Advantages and Limitations of PCA

Advantages of PCA

a) Removes multi-collinearity - Multi-collinearity, Co-relation between independent variables are not good for the Machine learning performance, PCA combines highly correlated variables into a set of uncorrelated orthogonal principal components, effectively eliminating multicollinearity between features

b) Decreases computational time - Since it reduces the dimensionality by reducing huge number of independent variables to a lesser set of corelated functional components, the trainining and testing of the ML model takes lesser time.

c) Helps to reduce overfitting - Overfitting mainly occurs when there are too many variables in the dataset. So, PCA helps in overcoming the overfitting issue by reducing the number of features.

d) Improves Visualization - PCA transforms a high dimensional dataset to low dimensional data (2 dimension) so that it can be visualized easily. 2D Scree Plot can be used to see which Principal Components result in high variance and have more impact as compared to other Principal Components.

e) Improves Algorithm performance - With so many features, the performance of algorithm will drastically degrade. PCA will speed up the Machine Learning algorithm by getting rid of correlated variables which don’t contribute in any decision making. The training time of the algorithms reduces significantly with less number of features. So, if the input dimensions are too high, then using PCA to speed up the algorithm is a reasonable choice.

Limitations / Disadvantages of PCA

a) Data normalization is a must before applying PCA - Standardization of dataset is a must before implementing PCA, otherwise it will be difficult to find the optimal Principal Components.

b) Will result in Some level of information loss - Due to addressing of outliers and dimensionality reduction there could be some informational loss. Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the right number of Principal Components with care, it may miss some information as compared to the original list of features

c) Independent variables will become less interpretable - After implementing PCA on the dataset, the original features will turn into Principal Components. Principal Components are the linear combination of the original features. Principal Components are not as readable and interpretable as original features

d) It is not suitable for small data sets